Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration
نویسنده
چکیده
A counterexample due to Williams and Baird [WiB93] (Example 2 in their paper) is transcribed here in the context and notation of two papers by Bertsekas and Yu[BeY10a], [BeY10b], and it is also adapted to the case of Q-factor-based policy iteration. The example illustrates that cycling is possible in asynchronous policy iteration if the initial policy and cost/Q-factor iterations do not satisfy a certain monotonicity condition, under which Williams and Baird [WiB93] show convergence. The papers [BeY10a], [BeY10b] show how asynchronous policy iteration can be modified to circumvent the difficulties illustrated in this example. The purpose of the transcription given here is to facilitate the understanding of this theoretically interesting example, thereby providing motivation and illustration of the methods proposed in [BeY10a], [BeY10b]. The example has not been altered in any material way.
منابع مشابه
Q-learning and policy iteration algorithms for stochastic shortest path problems
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1...
متن کاملLIDS REPORT 2871 1 Q - Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems ∗
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the s...
متن کاملAnalysis of Some Incremental Variants of PolicyIteration : First Steps Toward UnderstandingActor - Critic Learning Systems
This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actor-critic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Also related are Witten's adaptive controller (1977) and Holland's bucket brigade algorithm (1986). ...
متن کاملO -Policy Temporal-Di erence Learning with Function Approximation
We introduce the rst algorithm for o -policy temporal-di erence learning that is stable with linear function approximation. O policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, le...
متن کاملOff-Policy Temporal Difference Learning with Function Approximation
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010